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Abstract 

We consider the solution of stochastic dynamic programs using sample path 
estimates. Applying the theory of large deviations, we derive probability error 
bounds associated with the convergence of the estimated optimal policy to 
the true optimal policy, for finite horizon problems. These bounds decay 
at an exponential rate, in contrast with the usual canonical (inverse) square 
root rate associated with estimation of the value (cost-to-go) function itself. 
These results have practical implications for Monte Carlo simulation-based 
solution approaches to stochastic dynamic programming problems where it 
is impractical to extract the explicit transition probabilities of the underlying 
system model. 
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1. Introduction 


Consider a stochastic dynamic programming model (also known as a Markov decision 
process (MDP), see Arapostathis et al. 1993, Bertsekas 1995, Puterman 1994), in the 
setting where only sample paths of state transition sequences are available, e.g., when 
it is impractical to explicitly specify the transition probabilities, but the underlying 
system can be readily simulated. This is often the case when the system of interest is 
large and complex, and must therefore be modeled by a stochastic simulation model. 
One drawback of using sample path estimation is the relatively slow convergence rate 
for estimation of performance measures (e.g., the value, or cost-to-go, function), which 
is generally on the order of O(N~ 0 - 5 ), where N is the number of sample paths. The 
focus of this paper is the problem of finding an optimal policy, and we exploit the fact 
that the policy search involves ordinal comparisons, rather than absolute estimation. 
In practice, the main idea of this approach is to compare relative orders of performance 
measures in finding the best action as quickly as possible rather than wasting effort 
in getting a more precise absolute estimate of the value function associated with each 
possible action. Under appropriate conditions, we show that the probability of selecting 
suboptimal actions is bounded by a quantity that decays to zero at an exponential rate. 

The overriding purpose of our work is to provide a rigorous theoretical foundation for 
the sample path approach in finding good policies in stochastic dynamic programming 
problems. The convergence results obtained here are completely new to this setting. 
To put our results in some perspective, we touch on the most closely related work. A 
type of exponential (geometric) convergence rate is well known in the traditional MDP 
framework (e.g., Puterman 1994), where the convergence is with respect to the horizon 
length for the value iteration procedure in infinite horizon problems with explicitly 
known transition probabilities. Our finite action setting is included in the book of 
Bertsekas and Tsitsiklis (1996), where the solution approach goes under the name 
of neuro-dynamic programming, but the focus there is on approximating the value 
function , and sample path optimal policies are not analyzed. Our results buttress 
the literature on ordinal optimization see Ho et al. (1992, 2000), which focuses on the 
efficiency of ordinal comparisons rather than absolute estimation. In particular, the ex¬ 
ponential convergence rate for static stochastic optimization problems is established in 
Dai (1996) and Dai and Chen (1997). Also, somewhat in the same spirit as our approach 
is the work of Robinson (1996) and Giirkan, Ozge, and Robinson (1999), who consider 
sample path solution to stochastic variational inequalities, and establish conditions 
under which the sample path solution converges to the true solution; however, their 
setting is quite different from ours, in that we consider a dynamic model involving 
sequential decision making under uncertainty, and we focus on actually quantifying 
the error incurred in utilizing sample path estimates, going beyond just establishing 
convergence. 

The rest of the paper is organized as follows. Section 2 defines the problem setting. 
Section 3 establishes the theoretical results on the exponentially decaying probability 
error bounds for the basic finite horizon discounted cost problems. Section 4 briefly 
discusses some easy extensions, and the Appendix contain the detailed proof of one of 
the more technical lemmas. 
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2. Problem Setting 


In this section, we formulate the basic problem of minimizing total expected dis¬ 
counted cost in a setting where the state space and action space are finite, albeit 
possibly non-stationary. Let {Xk,k = 1,2,...} denote a Markov decision process with 
finite state space S (|«S| > 1), where X\ is the starting state. Let T > 2 be the time 
horizon, or number of periods (also known as stages), Sk C S be the state space for 
the kth period, and Ak(x),x £ Sk, be the (finite) set of feasible actions in state x and 
period k. At stage k in state x, the decision maker chooses an action a £ Ak(x); as a 
result the following occur: 

(i) an immediate (deterministic) cost Ck{x,a) > 0 is accrued, and 


(ii) the process moves to a state x' £ Sk+i with transition probability pk(x'\x, a), 
where p k (x'\x,a) > 0 and X^'eS fc+1 Pk(x'\x, a) = 1. 

The objective is to find a sequence of decision rules {/!&(•)} comprising a policy p = 
{^fc} that minimizes total expected discounted cost given by 


E 


f t 


Ys^CkiX^Ak) 

_k=l 


(i) 


where Ak is the action taken in period k — which would be pk(Xk) under policy p 
and a £ (0,1) is the (constant) discount factor. Here Xk+i depends on both Xk and 
Ak, i.e., given Xk = x and Ak = a, we have 


X k +i(x,a) ~ {pk(-\x, a)}. 


( 2 ) 


but such dependence will generally be suppressed for the sake of simplicity. Through¬ 
out, we assume a fixed initial state X\ = X\, but this can easily be generalized to 
the setting where the initial state is a random variable with an associated probability 
distribution. 

Define the optimal cost-to-go (or value) function from stage k by 


Jk(x) = min E 
it&A 


T 

£■ 

L i=k 


a^aiXi^iiXi)) 


Xk — X 


,Vx £ Sk, k = 1,..., T, 


( 3 ) 


where U denotes the set of all policies. The value of the MDP is given by Ji{x-\), 
and an optimal policy p* — defined as any policy that minimizes (1) — satisfies the 
following set of equations: 


p* k {x) £ arg min {c k {x, a) + aE [J k+ i(X k+ i(x, a))]}, fc = l,2,...,T, (4) 

a£Ak(x) 

where the expectation is taken with respect to the next state X n+ i, which is a function 
of the current state x and action a, and we follow the convention that Jt+i( - ) = 0. 
It will be convenient to introduce the Q-factors defined by the expectation on the 
right-hand side (e.g., Bertsekas 1995): 

Qk{x,a) = c k {x,a) + aE{J k+ i{Xk+i{x,a))\, k = 1,2, T, (5) 
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representing the expected cost of taking action a from state Xk = x in period k, and 
then following the optimal policy thereafter. In particular, 

Qt{x, a) = ct(x,o). (6) 


Thus, we have 

J k (x) = min Qk(x, a). 

aeAk(x) 

For finite horizon dynamic programming with finite space, backward induction can 
be used via Equation (3) to obtain the optimal value functions {Jk(x),x € Sk,k = 
1,2 T} and a corresponding optimal policy satisfying (4). For the infinite horizon 
case, value iteration, policy iteration, or variants on these are used to solve the sta¬ 
tionary version of (3) when applicable. When the transition probabilities are explicitly 
known, these procedures can sometimes be carried out in closed form or by using 
straightforward numerical procedures to calculate the necessary expectations. 

In our setting, based on sample paths of the MDP sequence X\,X 2 , ... for a given 
policy /x, the expectations in (1), (3), or (4), are estimated by taking sample means. 
By a sample path optimal policy, we mean a policy (possibly only partially specified, if 
not all states are visited in the sample paths) that optimizes the sample mean of the 
objective function given in (1). (This is not to be confused with using a single “long” 
sample path to estimate a stationary optimal policy for infinite horizon problems.) This 
will be a function of both the sample path length and the number of sample paths. For 
the finite horizon setting, the sample path length will be equal to the number of periods 
T, whereas in the infinite horizon case, the optimal policy is approximated by a finite 
horizon sample path optimal policy. A direct implementation for using sample paths 
would be to take a “large” number of samples for each value that must be estimated, 
thus in essence reducing the problem to the traditional setting. In practice, taking a 
large number of samples may be unnecessarily wasteful, especially when the ultimate 
objective is to find the optimal policy, not necessarily to precisely estimate the optimal 
value functions for all states. The underlying philosophy is that one may obtain good 
policies through ordinal comparison even while the estimate of the value function itself 
is not that accurate. 

3. Sample Path Probability Error Bounds 

We now derive probability error bounds for the convergence of sample path optimal 
policies to a true optimal policy. We focus on searching for the optimal action in the 
first period, since optimal actions for subsequent periods can be obtained in the same 
manner. Write the feasible action set for the initial period starting in state x\ as 

Ai(xx) = {ai,a 2 , 

The Q-factor of interest for the first period, as defined by (5), is 
Qi(x, a) = Ci(x, a) + aE [J 2 (X 2 (x, a))], 

where J 2 is the cost-to-go function defined by (3) with horizon T— 1. Since throughout 
we are focusing on the first period with initial state X\ = x\, we will simplify notation 
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by dropping explicit display of the dependence on the period and initial state by 
defining the unsubscripted Q-factor: 

Q(a) = Q 1 (xi,a). 

Without loss of generality, we assume 

Q(a 1 ) < Q{a 2 ) < ■■■ < Q{a m ), 

i.e., = Or. The case with ties for the best can also be handled in exactly the 

same way; see Remark 3.2 following Theorem 3.1. 

The procedure to estimate the optimal action from a given state in the setting of 
this section uses sample trees. Specifically, for state aq, for each action a; € Ai(:ri), 
n independent sample trees are generated. Each tree begins by taking action a/ in 
period 1, and then sampling all possible actions in subsequent states visited. Since 
the state space is finite, there may be common states visited between trees and also 
within trees. We keep the tree structure by sampling from each node separately and 
independently according to (2), so there will be no “recombining” branches, even if the 
same state were reached at different nodes of the tree. To be more specific, a sample 
tree is generated as follows for initial period action ap. 

(i) In period 1, generate one next (period 2) state sample (node) according to 
Pi{-\xi,ai). 

(ii) In period 2, generate a next (period 3) state sample (node) according to p 2 (-\x, a), 
for each feasible action a € A 2 (x), where x is the state generated in step (i). 

(iii) Starting from each state x visited in period k of the tree ( k = 3,...,T— 1), 
generate a next (period k+ 1) state sample (node) according to pk(-\x, a ) for each 
feasible action a € A k {x). 

As mentioned earlier, all sampling is done independently of other trees and other nodes 
in the same tree; however, correlation between sampling of different actions from the 
same node in a tree is allowed. Let s\^ C Sk, k = 2,..., T, denote the set of states 
actually visited in period k over all n sample trees initiated with action ap An example 
for n=3 is shown in Figure 1. In this example, even if x$ = xq, i.e., the state reached is 
the same, the nodes themselves remain distinct, in that separate independent samples 
would be generated from each for each possible action in A^{x^) = As(xe). 

Sample path estimates for the Q-factors and cost-to-go functions are obtained via 
backward induction as follows: 

Q^\x,a) = cr(x,a), x £ s!p t (7) 

Pk\ x ) e ar S min Q { l\x,a), x £ k = 2, ...,T, (8) 

aeAk{x) 

Jk\ x ) = min Q^\x,a) =Q ( ^ ) {x,p ( jl\x)), x £ s[!\ k = 2,... ,T, (9) 

aeAk(x) 

Qk\x,a) = c k (x,a)+ \Afj, l l 1 (x,a)\~ 1 a^2 Jk+ 1 ( 2 /). x e k=2,...,T- 1,(10) 

where A/S^a;, a) is the multi-set (i.e., includes states repeated if sampled more than 
once) of states reached in period k + 1 from state x with action a in period k (k = 
1,...,T-1). 
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A 2 (x 2 ) = {a[, a' 2 }, A 2 (x 3 ) = {a' 3 , a’ A , a' 5 }, ^2(^3) = {<, a' 7 }, 

S 2 } = {x 2 , x 3 , Xi\, = {x 5 , x 6 ,..., Xu}. 

Figure 1: Example of simulated trees for n= 3. 

Similar to the unsubscripted initial period, initial state, Q-factors defined earlier, 
we define the following corresponding tree-based estimator: 

Q(ai) = ci(x!,ai) +-a V' y ), 

n 

y&M(ai) 

where we have defined Af(ai) = N 2 \x\,a{) and |A/"(aj)| = n. We then estimate the 
optimal first-period action in the natural way: 

ai(n) = arg min {Q(a/)}. (11) 

aiGAi(xi) 

Averaging over A ( x , a) in (10) is needed to ensure consistency in defining decision 
rules via (8), since the same state can be reached more than once in sampling, on the 
same tree or on different trees. If all n trees for a given ai are distinct with no common 
states in any period beyond the initial state, so {Aff^l^x, a)| = 1 for k > 1, then the 
DP algorithm simply corresponds to performing (deterministic) backward induction 
individually on each tree. 

Figure 2 shows a simple example, which we use to illustrate how Equations (7)-(ll) 
are applied and why the averaging is necessary. There are two trees (n=2), and both 
reach the same state x 2 in period 2, hence the multi-set A f{ai) = {x 2 ,x- 2 }. Assume 
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A 4(*1 ,ai) = {x 2 ,x 2 } 1 Af 3 (x 2 ,a[) = {x 3 , 2 : 4 }, Af 3 (x 2 , a 2 ) = { 214 , 2 : 5 }, 

S 2 ] = {x 2 },5^ } = {x 3 ,xi,x 5 }. 

Figure 2: Example of two simulated trees with common states (costs shown on branches). 
Dynamic programming performed separately on each tree would lead to a different action from 
state X 2 in period 2 on the two trees (a) in the upper tree, a' 2 in the lower tree). Averaging 
appropriately over the corresponding nodes in two trees leads to /J 2 (® 2 ) = a[. 

for simplicity that the discount factor is one (a = 1). Applying the DP algorithm 
separately to each tree, we obtain (suppressing superscripted ( l ) for notational con¬ 
venience) J 3 {x 3 ) = 1, < 73 ( 2 : 4 ) = 3, J 3 (x 3 ) = 1; for the upper tree, J 2 {x 2 ) = 2 and 
M 2 (tC 2 ) = a}, whereas for the lower tree, J 2 (x 2 ) = 3 and jx 2 (x 2 ) = a' 2l leading to a 
conflict in specifying the decision rule (action for state x 2 ). On the other hand, with 
the averaging (over just the two trees, i.e., n = 2), Q 2 {x 2 , a^) = 1 + (1 + 3)/2 = 3 and 
Q 2 (x 2 ,a' 2 ) = 2+ (3 + l)/2 = 4, which gives J 2 (x 2 ) = mm{Q 2 (x 2 , a^), Q 2 (x 2 , a' 2 )} = 3 
and ji 2 (x 2 ) = argmin a / {Q 2 (x 2 ,a' i )} = hence Q(ai) = Ci(x\,ai) + 3. This would 
be repeated for all other actions in Ai(x 3 ), and then (one of) the action(s) with the 
lowest value of Q(-) would be selected to be the estimated optimal action in state x 3 . 

Our results use the large deviations principle (cf. Dembo and Zeitouni 1998), which 
yields exponentially decaying probability error bounds under appropriate conditions. 

Lemma 3.1: Consider a sequence of i.i.d. random variables {Y n ,n > 1} with moment 
generating function M( A) = .E[exp(AYi)]. Let S n = Y^i=\Yi- If M(X) exists in a 
neighborhood (— e, e) of A = 0 for some e > 0, then 

P(S n /n >x)< exp(—nA(j_(a;)), \/x, 

and 

P{S n /n < x) < exp(—nA* (a:)), Vx, 
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where 


A)j_(x) = sup (Ax — logM(A)) 

0<A<£ 

and 

A* (a;) = sup (Ax — logM(A)). 

— £<A<0 

Furthermore, if \Yi \ < M for some constant M < oo, then A^J_(x) > 0 for x > E\Y\\, 
and A* (x) > 0 for for x < E[Yi\. 

Proof. The first part follows directly from Xie (1997). For the second part, we show 
only the x > E\Y\] case, since the x < E\Y{\ case is similar. 

Using a Taylor series expansion around A > 0, there exists £ £ [0, A) such that 

A(A) = logU[exp[AYi]] 

= A(0) + A'(0)A + V(£)A 2 = A E[Y X ] + ^A"(£)A 2 , 

the last equality following from A(0) = 0 and A'(0) = E[Yi\. 

We now turn to evaluating A"(£). Since, |Yi| < M, 

u[y 1 2 exp(^yi)]u[exp(gy 1 )] - (£[yiexp(gYi)]) 2 
(/••'expiO’:)) 2 

g[U 1 2 exp($li)] 2 

U[exp(£Fi)] - ' 

Consequently, for x > E[Yi\, 

A+(x) = sup{Ax — logU[exp[AYi]]} 

A>0 

> sup Ixix-EIY^)-—- >0, 

A>0 l Z ) 

completing the proof. 

Remark 3.1: For a finite-horizon MDP with finite action and state spaces, the total 
discounted cost Y2k=i aklc k(Xk, Mfc(ATfc)) has finite moment generating function on 
(— 00 , 00 ) for any policy /r £ U. Define 

Co = max max Cfc(x,a), 

fc£{l,2,...,T} x€Sk,a£Ak(x) 

and 

T 

Jo = ^ a k ~ 1 Co. 
fc=1 

From the backward induction DP algorithm, it is easy to show that for any l, /c, and 

x, 

jl l) (x)<Jo, 

and, from the definition of Jfc(x), it is easy to see 

Jk(x) Y Jo- 


( 12 ) 

□ 


A"(0 = 

< 
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Set 


1.41= max max |„4fc(a;)|. 

ke{l,2,...,T}x&S k 


Lemma 3.2: If 7 £ (0,1) and <5 > 0 satisfy 

2 A 7>< 5 < Q(a 2 ) - Q(ai), 


then 


where 


t- 1 


T—2 


where A 7j< 5 = (^ a 1 1 )<f+(^a* ^l^lyJo, 


i=l 


i=l 

/T—2 


p ^JJ{0(a,) > Q(ai) + A 7 , (5 }J < \A\ |5|*exp(-n7M')J , 

P ( UW(°») - ~ < I- 4 ! ( I 5 !* exp(-n7 l (5') N ) , 


\ i =1 


>.»=0 


S' = sup 

A>0 [ 




> 0 . 


Proof. See the Appendix. 


(13) 

(14) 

(15) 

(16) 

(17) 


We are now in a position to present and prove the main result of this section. In 
words, the theorem states that the sample path first-period optimal action(s) contained 
in the set ai(n) converges in probability to the true optimal action a\ for the finite 
horizon problem at an exponentially decaying rate with respect to the number of sample 
paths (trees). 

Theorem 3.1: 

/T—2 

P{ai(n) ± {ai}) < 2|A| I ^ \S\ l exp(-n 7 *< 5 ') 

\i =0 

where 7 and <5 satisfy the conditions of Lemma 3.2 and S' is given by (17). 

Remark 3.2: If Q(ai) = < 5 ( 02 ) = ... = Q{a,k) < Q{au+ 1 ) < ... < Q{a m ), then the 
left-hand side just becomes P (ai(n) % {ai,..., 07 -}). 

Proof. Suppose that ai(n) ^ {ai}- Then, 3 1^1 such that Q{ai) < Q{a 1 ), i.e., 



P(ai(n) ^ {ai}) = P 


< P 




V^i 




< Q(ai)} 


< Q(oi)}, Q(ai) < Q(ai) + A 7 , 5 


y/l 

+ P 


^Q(ai) > Q(a 1 ) + A 7 >< 5 ^ . 


Since Q(a 2 ) < Q(ai ) for any a;(y^ ai), condition (13) gives 


Q(ai) < Q(a 2 ) - 2A 7 j < Q{af) - 2 A 7>(5 , 



or 

l) T ^7,5 ^ Qipi) ^7,5, 

so we have 


/T—2 


P(ai{n) ± {ai}) < P (J{Q(a;) < Q(ai) - A 7>(5 } + |A| ^ |5|‘exp(-n7*5') 


i 


,i=0 


< 2|A| ^Xjl 5 rexp(-n7*(5 , )j 


where Lemma 3.2 has been applied twice. 


4. Extensions 

The results can be extended to the following cases with essentially the same frame¬ 
work: 

• random costs; 

• stochastic and non-stationary discount factor, by replacing a k throughout by 
UU a P w ^ ere a j i s the discount rate for period j. 

Convergence of the same algorithm for infinite state spaces is not a problem, but 
the current method of proof for the convergence rate result will not carry through. 
Extension to infinite action spaces is also not straightforward, as the current algorithm 
is not even applicable. These extensions are topics of ongoing research. 


Appendix A. Proof of Lemma 3.2 


We show (15) only, as the proof for (16) proceeds analogously. First, we first 
establish three preliminary results. 

Lemma Al: Let Z. t ~ p k (-\x, a) i.i.d. for fixed x £ Sk, a £ Ak{x). For any N > 0 and 
S > 0, 


P 


a 

N 


N \ 

T. Jk+\{Z%) > aE[J k+1 (Zi)\ + 6 I < exp(-AW), 

*=i / 


k = 1, ...,T — 1, where S' is given by (17). 

Proof. The proof follows directly from Lemma 3.1, with 1) = a(Jk+i(Zi) — 
E[Jk+i(Zi)]), so E[Yi\ = 0 and Y, has finite moment generating function (cf. Remark 
3.1). Applying the first part of Lemma 3.1 leads to 

N \ 


P 


1 

N 


y aJk+ijZi) > aE[J k+1 (Zi)\ +6 < exp(-NA\(S)), 


where 


A^_(<5) = sup(A<5 — log.E , [exp[AY'j]]). 

A>0 


Since |Y)| < aJo = M, the second part of Lemma 3.1 can be applied: 


sup(Ai) — log E[exp[A 17 ;.]]) > sup < X6 — 


(«J 0 ) 2 A 2 


= S' > 0, 


A>0 


A>0 
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with S' derived using (12). □ 

Lemma Al': Under the same conditions as Lemma Al, let Af be a non-negative 
integer-valued random variable independent of {Zij. Then, 


( M 

P U 7 E Jfc +1(^) ^ <*E[J k+1 (Zi) 

\ i=1 


+ S,Af > A ] < exp(— NS'), 


k = 1, ...,T — 1, where S' is given by (17). 

Proof. Using Lemma Al, note that the conditional probability 


P 


a 

If* 


M 


Jk+ijZj) > aE[Jk+i(Zi)] +7 


i— 1 


= p 


AT = N 0 


a 


No 


AT E Jk+i(Zi) > aE[J k+1 (Zi)\ + 7 


AT = N q 


< exp(— N 0 S r ). 


Unconditioning yields the desired result. □ 

Note that \Af2\x,a)\ is constant over At~i{x), i.e., \Afj; l \x,a)\ = \Aff l \x,a’)\, for 
all a' € At-i(x), so we simplify notation by dropping the dependence on the action in 
writing |A/"/^(a:)| for \Afj l \x, a)|. 

Lemma A2: For x £ St- 1 , 

p (Jt-i(x) > Jt-i(x) + S,\Af^\x)\ > a) < |A| exp(-NS'), ( 18 ) 


Proof. For x £ St, 

W/\ 

Jj, (x) = min Qt(x,o) = min ct(x,o) = Jt(x), 


aGArix) 


aGAr(x) 


SO 


Jr-i(x) = min [cT-i(x,a)-\-\Af!p{x)\ 1 a'E' Jt 
a€A T -i(x) 1 7 -^ 

y&M^’ (x,a) 


Note that 

{4-i(*) > Jt-i(x) + 6 } 


| mm |cT-i(a;, a) + |Aj. (x)| ^E Jt 
t aeAr-l(x) 1 

yeATrj!’ (x,a) 


> min 

a^Ar-i (x) 


1 ^ ^ |cT-i(a;, a) + a£[Jr(Xr(i, a))] + <5 j j 


C 


(J { W?( x ) I ^E JT{y)>aE[J T (X T {x,a))]+S }. 

aS^T-i(x) yeN^\x,a) 


(19) 


Thus, 

P ( 4-1 (*) > Jt-i(*) + 5 , \Af^ (a;)I > Nj 

< P( U MV) > *E[MX T (x,a))\ + 6, |a4 ;) (x)| > A}) 

ciEAt- i(ar) i/S N!p{x,a) 

< |A| exp(—A< 5 '), 
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the last inequality following from Lemma Al', proving (18). □ 

Lemma A3: For k G {2,3 ,T — 1}, x € S k , and a € A k {x), 

P(j2\x) > J k (x ) + C k , l-A/^Or)| > ny^ 1 ) < D k , (20) 

where C k = a l ~ l )5 + (EE*"” 1 « <_1 )l«5|7^o 

and D k = |-4.|(X^^E) 1 ^ I 5 !* exp(-n 7 fc_1+ W)). 

In particular, C\ = A 7> ^ and Ct- i = S. 

Proof. We establish the result via backward induction. By (18) in Lemma A2, (20) 
holds when k = T — 1. Assuming that (20) is true when k = t, t G {3, ...,T — 1}, we 
want to show that it holds when k = t — 1. 

Recall that A r 7 (y) denotes the number of times state y is reached in period k over 
all n sampled trees initiated by ai, and define the set 

Pk } = Kk\l) = {y e 4° : nH\v) > 


where explicit dependence on 7 is omitted for notational simplification, since it is fixed. 
If y G 7 Z%\ then state y was visited at least ri 7 fc_1 times in period k. 

From the definition of ,J t ' given by (9) and (10), we have the following decomposition 
for x G S t - 1 : 


Jt-i(x) = 


= mm 

4.; 


1 \c t -i(x,a) + \Af t (l \x)\ ^E 4 

y&N't ^ (x,a) 


?(0 

t 


= mm 
oeA 


m jct-r (a:, a)+ |A/)° (a:)| E A* (y) 

-A*) m 

yeAT} 1 ’(x)nn\ n 

+ i-A/t ,) (*)r 1 « E ^°(»)}. 

i/GA/; (i) (cc,a)rrR< !) 


( 21 ) 


where the set complement is denoted using the overbar, and the intersection of a 
multi-set and an ordinary set is assumed to be given by a corresponding multi-set. For 
example, {1,1,1,2,3} D {1,3, 5} = {1,1,1,3}. We now find bounds for each of the last 
two terms in the decomposition given by ( 21 ). 

By definition of 7we have the following bound: 

Or, a) nH^\ = J2 N k\y)H N l l) (y) < n T^ 1 } < E n 7 fc_1 = I 

y&S y&S 


Thus, for x G S k -1 such that \Aft l \x)\ > rry 4 \ we have 


l-EE)! 1 a 


Jt\y) < I A/;(0 ( En) n 1 a Jo (since Jf(-) < J 0 ) 




|aE(7I 

< aJ 0 |5|n77(n7 t_1 ) < |5|7J 0 . 


( 22 ) 


Note that for a G At-i(x),y G A fj l \x,a) D 7 z[ l \ we have |A/’E(y)| > 727 *, and by 
the induction assumption, ( 20 ) holds when k = t, so 

PG 7 f(y) > Jt(y) +C uy g E° 7 >«) n 7 ^ } ) 

< P ((y) > At (y) + Ct, lE+r (y) I > n 7 4 ) < D u 
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implying that 


P 


/ 

u U {J?\y)>My) + c t } 

(x) j /e jV t ( ! > ( cc ,o)n'R< i) 

/ |5| 

< p U^ 0 ( S< ) - Jt ^ + Ct ’ 1-^+1 (®0I > «7 4 } ] < |5|D t , (23) 


where we have enumerated all possible states as S = {si,..., S| 5 |}- 

Hence, similar to the proof of Lemma A2, by combining (21), (22) and (23), we have 


P 


(jt-i( x ) > + C t - 1, \M't l) ( x )\ > n 7* *) 

< p( (J {|A/' t (l) (a;)|- 1 a E J^°(y) > aE[J t (X t (x, a))] + C t _i|, \N?\x)\ > ny 4-1 


oeA-iW 
= P 


y&Mt\x,a) 


( U {m (,) (*)i 

ae^lt-i(x) 


> 


E 3 0 (») + E 3° go] 

J/eAT t (i) 2/6AT t (i) (x,a)rrR( 0 

aP[J t (A*(a-,a))] + Cu}, |A/fE)l > ny 4 " 1 ) by (21) 

< P( U {|5|7^o + |^ ,) (*)r 4 a E % l) (y)>aE[J t (X t (x,a))\+C t -i}, 

at At- i(x) i/&V t (0 (x,a)n7^° 

l-EE)l > «7* _1 ) by (22) 

< P ( U {| 5 | 7 ^o + |^ ,) (*)|- 1 a[ E ( Mv)+Ct) 

a&At-i(x) ye^ l \x,a)nn[ l) 

> aE[J t (X t (x,a))\ + C t _i}, \M t {l) (x)\ > ny 4 " 1 ) + \S\D t by (23) 


= P 


< P 


( U {i£i7-/o + |A/;( °| ( .%)^' r + i-EEr^ E w 

o€A-i(*) I ( wOl ySA/' t <i) (x,a)n7?.( i) 

> aP[J t (X t (a;,o))] + C^}, |A/? 0 (a;)| > ny 4 " 1 ) + |S|D t 

( U {lE^wr^E - a - B [ j t(^’t( a: ) a ))]+^}> 

aeA-i(x) 1/S Af t (0 (x,a) 

\K (l \x)\ > ny 4-1 ) + |<S|A (since 5 = C t -i - aC t - |<S|yJ 0 ) 


< |A| exp(—ny 4 i i5 / ) + |5|D t = D t -\ by Lemma Al', 


(24) 


completing the induction. 


Similar to the proofs of Lemmas A2 and A3, we finish the proof of Lemma 3.2 by 
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establishing (15), recalling that A f{a{) = Jf^ (a?i, ai) and |A/"(a/)| = n: 


P (J{Q(a;) > Q{ai) + A 7i<5 } 


\l= 1 

= P 


< 


( U {- ^ aE[MX 2 {xx,ai))] + Ci}) 

l— 1 y£j\f(ai) 
m 

P [^ + a £2 ”1” a/))] + + |5|D 2 


= P 


1 =1 y&N(ai) 

by (21), (22), (23) 
1 


( [J {“ E aj2 (y) > a-B[J2(-X’2(a:i,ai))] +<5 J'}) + l^l-^ 2 


i=l y&N(ai) 


< |A| exp(—nJ') + |«S |£>2 = D\ = \A\ 1^1* ex P(“ n^ l 5')^ using Lemma Al. 


Acknowledgements 

This research was supported in part by the National Science Foundation under 
Grants DMI-9713720 and DMI-9988867, and by the Air Force Office of Scientific 
Research under Grant F496200110161. Xing Jin also acknowledges the support of 
National University of Singapore under Grant R.-146-000-045-101. 


13 



References 


[1] Arapostathis, A., V.S. Borkar, E. Fernandez-Gaucher and, M.K. Ghosh and S.I. Marcus, 
“Discrete-Time Controlled Markov Processes with Average Cost Criterion: A Survey,” SIAM 
Journal on Control and Optimization , 31, 282-344, 1993. 

[2] Bertsekas, D.P., Dynamic Programming and Optimal Control, Vol. 1 & 2, Athena Scientific, 
1995. 

[3] Bertsekas, D.P., and J.N. Tsitsiklis, Neuro-Dynamic Programming , Athena Scientific, 1996. 

[4] Dai, L., “Convergence Properties of Ordinal Comparison in the Simulation of Discrete Event 
Dynamic Systems,” Journal of Optimization Theory and Applications, 91, 363-388, 1996. 

[5] Dai, L., and C. Chen, “Rate of Convergence for Ordinal Comparison of Dependent Simulations 
in Discrete Event Dynamic Systems,” Journal of Optimization Theory and Applications, 94, 
29-54, 1997. 

[6] Dembo, A., and O. Zeitouni, Large Deviations Techniques and Applications, 2nd edition, 
Springer-Verlag, 1998. 

[7] Giirkan, G., A.Y. Ozge, and S.M. Robinson, “Sample-path Solution of Stochastic Variational 
Inequalities,” Mathematical Programming, 84, 313-333, 1999. 

[8] Ho, Y.C., C.G. Cassandras, C.H. Chen, and L.Y. Dai, “Ordinal Optimization and Simulation,” 
Journal of Operations Research Society, 51, 490-500, 2000. 

[9] Ho, Y.C., R. Sreenivas, and P. Vakili, “Ordinal Optimization of DEDS,” Discrete Event Dynamic 
Systems: Theory and Applications, 2, 61-88, 1992. 

[10] Puterman, M.L., Markov Decision Processes, John Wiley & Sons, New York, 1994. 

[11] Robinson, S.M., “Analysis of Sample Path Optimization,” Mathematics of Operations Research, 
21, 513-528, 1996. 

[12] Xie, X., “Dynamics and Convergence Rate of Ordinal Comparison of Stochastic Discrete-Event 
Systems,” IEEE Transactions on Automatic Control, 42, No. 4, 586-590, 1997. 


14 



